Learning by message-passing in networks of discrete synapses 
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We show that a message-passing process allows to store in binary "material" synapses a number 
of random patterns which almost saturates the information theoretic bounds. We apply the learning 
algorithm to networks characterized by a wide range of different connection topologies and of size 
comparable with that of biological systems (e.g. n ~ 10 5 — 10 6 ). The algorithm can be turned into 
an on-line -fault tolerant- learning protocol of potential interest in modeling aspects of synaptic 
plasticity and in building neuromorphic devices. 
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Learning and memory are implemented in neural sys- 
tems mostly through distributed changes of synaptic effi- 
cacy [y. The learning problem in neural networks (NN) 
asks whether one can find values for the synaptic effi- 
cacies such that a set of p patterns are stored simul- 
taneously. Depending on the structure of the network 
— feed-forward or recurrent - the storage problem is ei- 
ther seen as a classification problem (input patterns are 
classified according to the output of the network) or as 
an attractor dynamics problem (patterns are the exter- 
nal stimuli which drive the dynamics of the network to 
the closest attractor) |2j. In any case, understanding 
the mechanisms underlying synaptic changes constitutes 
a crucial step for modeling real neural circuits (e.g. the 
Purkinje cells in the cerebellum Q). On the purely theo- 
retical side many basic results have been derived, ranging 
from information theoretic bounds [J, |9| and statistical 
physics analysis of learning capabilities [fjj in model NN 
to concrete algorithms, like artificial pattern recognition 
systems. Still there exist many open conceptual prob- 
lems that are related to the need of satisfying realistic 
constraints Q. Modeling material synapses is possibly 
one of the most basic ones, the discrete case (and specif- 
ically the switch-like binary one) being of particular ex- 
perimental |8j and technological interest |9(: recent ex- 
periments - at the single synapse resolution level - have 
shown that some synapses undergo potentiation or de- 
pression between a restricted number of discrete stable 
states through switch-like unitary events It is has 
been known since many years that the discreteness of 
synaptic efficacies makes the learning problem extraordi- 
narily difficult 1 1 01 ] : even the task of finding binary synap- 
tic weights for a single layer network (the binary percep- 
tron) which classifies in two classes a given set of patterns 
is both NP-complete and computationally hard on aver- 
age (as observed in classical numerical experiments). In 
spite of the fact that binary networks can in principle 
classify correctly an extensive number p = an of random 
patterns with n binary synapses 0], practically there 
exists no known algorithm which is able to store exactly 
more than just a logarithmic number |l'2l as soon as 
a sub-exponential cut is put on their running time. 



Here we present a distributed message-passing algo- 
rithm of statistical physics origin which is able to store 
efficiently an extensive number (p = an with a > 0) of 
random patterns in binary NN characterized by a wide 
range of different topologies. We consider single and 
multi-layer networks with local connectivities of the neu- 
rons ranging from finite to extensive. The typical com- 
putational complexity of the algorithm will be shown to 
scale roughly as O (n 2 log (n)), that is almost linearly on 
the size of the input for an extensive number of patterns. 
This fact together with the parallel nature of the algo- 
rithm allows to easily find optimal synaptic weights for 
systems as large as n — 10 6 with a relatively close the 
critical value a c above which perfect learning is no longer 
possible. From the algorithmic viewpoint, our solution to 
the binary learning problem should be seen as an exam- 
ple of solution of constraint satisfaction problems over 
dense factor graphs (a graphical representation of combi- 
natorial constraints used in information theory [T3.ll5| ). 
As such, our result show how the recent progress in com- 
binatorial optimization by statistical physics and mes- 
sage passing techniques which have allowed to solve ef- 
ficiently famous combinatorial problems like random K- 
satisfiability ^(| or random graph Q-coloring ^tJi can 
be extended to other classes of problems in which con- 
straints involve an extensive number of variables. 

The NN models that we shall consider are composed 
of simple threshold units connected by binary weights 
w$,k = ±1. For the sake of simplicity we consider two 
layer networks with one output unit and with weights of 
the output layer that are fixed W£^ ou t = 1 (see Fig. 1). 
Each of the K internal units is connected to ci inputs in 
either a tree-like structure or in an overlapping way. We 
will consider NN with connectivities ranging from finite 
to extensive, i.e. take q = O (n e ) where e G [0, 1]. In or- 
der to keep extensive the overall number of synapses we 
chose K oc (c) , where (c) is the average connectivity. 
Under these conditions, the information theoretic bounds 
on the maximum number of bits which can be stored in 
the binary synapses are compatible with the exact stor- 
age of an extensive number of patterns (p — an, a > 
0) 0| . The output ti of each internal unit is just the sign 
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FIG. 1: A non-overlapping two-layer network with six 
synapses (empty circles) and three threshold units (filled 
dots), and its corresponding factor graph for four patterns 
(right). The factor graph is composed of variable nodes (cir- 
cles; indices i,j, o in the text) and function nodes (squares; 
indices a,b in the text); messages "travel" over the edges of 
the factor graph in both directions. Note that while synap- 
tic weights have a unique corresponding variable node on the 
factor graph, each of the two auxiliary variable computing 
a partial threshold (hidden units), being pattern-dependent, 
must be replicated for every pattern on the factor graph. 



of the weighted sum of its inputs £j minus some thresh- 
old, T£ = sign(^ jel/ (£) w jAi - 7<0 where V{t) is the set 
of inputs connected to unit £. The overall output a of 
the network is given by cr(^) = signQ^^Li n - -f out ). 

For K = 1 and c — n we recover the binary percep- 
tron, which is the elementary building block of many NN 
models. In the case of random input patterns, statisti- 
cal mechanics and rigorous methods 0, H, IE EH have 
allowed to study the typical behaviour of this type of 
systems in the limit of large n. For instance the stor- 
age capacity a c has been computed for different finite 
values of K. Interestingly enough, the general scenario 
for binary networks is that while the storage capacity is 
indeed extensive the geometric structure of the space of 
solutions in the satisfiable region a < a c is rather com- 
plex |l8| . Optimal synaptic configurations are typically 
far apart in Hamming distance and coexist with an expo- 
nential number of sub-optimal configurations in which an 
extensive number of errors are made. Sub-optimal states 
act as dynamical traps for learning algorithms |l3| . Here 
we first show how the so-called belief propagation (BP) 
equations 0, ^| (a variant of the Bethe approximation 
in statistical physics) can be applied on single problem in- 
stances, providing useful information such as the entropy 
of solutions, agreeing with statistical physics results in 
the large n limit Next we modify the equations by 
introducing a local reinforcement term which forces the 
system to polarize to a single optimal configuration of 
synaptic weights, effectively turning BP into a solver for 
this problem. 

For simplicity let's fix a threshold value 7 and first 
consider a perceptron with binary weights Wi G { — 1,1} 
for i — 1, . . . ,n. Given an input pattern £, the binary 
perceptron is an elementary device which just computes 
the function / w (£) = sign (£\ WiCi - 7) G {-I, I}. Pat- 



terns £ will be then classified by this perceptron by its 
output into the two preimage sets of the function / w . 
Given two sets of random patterns 3± we want to find 
vector of synaptic weights w such that / w (B±) = ±1. 
Consider the uniform probability space over the set W 
of all optimal assignment. We are interested in single 
marginals, that is the probabilities P (wi = ±1) that the 
single synapses take a certain binary value. Under some 
weak correlations assumption, it is possible to write a 
close set of equations for these quantities. Such BP equa- 
tions provide results which are believed to be exact in cer- 
tain classes of problems defined over sparse factor graphs 
in which the size of loops tends to infinity with t he p rob- 
lem size (e.g. in low density parity check codes |15|). In 
the case of problems corresponding to highly connected 
factor graphs (like the learning problem we discuss here) 
the validity of the BP approach relies on an apparently 
stronger condition, the so called clustering hypothesis, 
in which the weak correlations condition arises from the 
weak effective interactions among variables. Until re- 
cently no algorithmic approach existed that allowed to 
study the properties of a given problem instance of this 
type. Previous attempts in this direction were based on 
iterations of the mean-field TAP equations which 
turn out to diverge in most cases. Recently BP has 
been used to study some densely connected problems on 
which it was shown that BP equations converge while 
TAP equations do not, even though the fixed point of 
the two is the same |20j. 

At variance with statistical mechanics results where 
the average over the patterns and the limit n — > 00 
are done, here we are interested in single problem in- 
stances. Thanks to the concentration of measure of the 
error-energy function, the so called self-averaging prop- 
erty, we expect the quantities estimated by the equations 
on single problem instances to match the typical case as n 
gets large enough. Despite the fact that the approxima- 
tions behind BP become exact only as n gets large, also 
at finite n the results provide very good approximations 
which can be used for algorithmic purposes (see Fig. 
A large n expansion of the BP equations for the K = 1 
and 7 = network learning problem read: 
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where / (a, b) = y J^°° exp ^— ^ ' 2 (i-b) 1 ) ^ x ) • At the 
fixed point mi—, a represents the mean value of Wi over 
the set of of synaptic weight configurations satisfy- 
ing all patterns except pattern £ a . The quantity hi—> a is 
referred to as local field that synapse i feels in absence 
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FIG. 2: BP entropy vs. a for single problem instances of size 
n — 3465 for K = 1,3,5,7. The analytic result for K = 1 
and K S> 1 for n — > oo are also plotted for comparison. The 
upper inset shows Q* vs. t of the analytical DE prediction 
(dashed line) vs. simulations over a system of size 10 s + 1 at 
a — 0.6 without reinforcement (data in perfect agreement to 
the prediction) and with reinforcement (70 = 0). The bottom 
inset shows the fraction of errors E/n vs t for both cases. 
In the latter case we can see that Q* — > as the solution is 
reached. 



of pattern a. The fixed point of these equations provide 
the information we are seeking for. Solving the equa- 
tions by iteration proved itself to be an efficient tech- 
nique, fully distributed, which is known as a message- 
passing method (the components of the vectors u and h 
can be thought as messages running along edges of the 
factor graph, see Fig. From the fixed point we may 
compute the list of all probability marginals P (wi = ±1) 
together with global quantities of interest such as the en- 
tropy (normalized logarithm of the size of the set W) . As 
expected from the statistical mechanics results the 
entropy is monotonically decreasing with a and vanishes 
at a c ~ 0.833 for n large enough. Similar results can be 
derived for multilayer networks as shown in Fig. [21 The 
BP equations can be adapted in a straightforward way 
to networks of arbitrary topology, even if the notation is 
slightly more encumbered. In general this network will be 
formed by connecting several perceptron sub-units. The 
corresponding factor graph can be recovered trivially as 
in Fig. ^ by just replicating every perceptron for each 
pattern, and adding a set of auxiliary units to represent 
the output of every perceptron sub-unit of the network. 
It will suffice then to derive a set of slightly more general 
BP equations for the perceptron which we omit for the 
sake of brevity. We have studied analytically the dynam- 
ical behaviour of the BP algorithm in the large n limit by 
the so called density evolution (DE) technique (see e.g. 
[20) for details on DE). In the upper inset of Figure |3 
we can see the comparison of numerical simulations of 
large single instances with the analytical prediction of 
the quantity Q = 1 — — W y^ „ ^ a m 2 _^ Q at every iteration 
step. In the spirit of |16j . a way of using the informa- 



tion provided by BP is to "decimate" the problem. This 
approach is indeed feasible and leads to optimal assign- 
ments. However here we focus on a much more efficient 
and fully distributed version [2l| of the algorithm. The 
idea is to introduce an extra term into Ecis. 11131 enforc- 
ing hi — ±00 at a fixed point, and use Wi = sign(/ij) as 
a solution. This term is introduced stochastically (with 
probability at the first iteration and probability 1 at 
t = 00) to improve convergence. We will replace Eq. |3| 
with Eqs. EE 

\ t+i = 4=e^+|!* w - p ? (4) 

1 Jn^ 1 h\ w.p. l-7t 

b v 

h*^ = h? 1 - ( 5 ) 

We will use 7* = Jq for < 70 < 1 (though other choices 
are also possible). Choosing 70 = 1 clearly gives back 
the original BP set of equations, Eqs. 11131 We note that 
a similar inertia term 7/1' (constant 7) was introduced 
in |22|, which would correspond to average the one in 
Eq. Q] Note also that the extra term for j t = corre- 
sponds to adding an external field equal to the local field 
computed in the last step. Remembering that "fixing" a 
variable as in the standard decimation procedure is equiv- 
alent to adding an external field of infinite intensity, one 
can think of this procedure as a sort of smooth decimation 
in which all variables (not only the most polarized ones) 
get an external field, but the intensity is proportional 
to their polarization. Numerical experiments of learning 
randomly generated patterns have been carried out on 
systems of various sizes (up to n = 10 6 ), with different 
choices of K and with different topologies (overlapping 
and tree-like). Some are reported in Fig. [3] An easy 
to use version of the code is made available at ;23] . It is 
not hard to think how the same algorithm could be made 
effective also in presence of faulty contacts and hetero- 
geneous discrete synaptic values, (which need not to be 
identified a priori as the message-passing procedure, dis- 
tributed over the same graph, could incorporate defects 
by modifying accordingly the messages). Even for the 
limit case of continuous synapses the process converge to 
optimal solutions in a wide range of a. 

Experiments have been performed using an improved 
version of Eqs. 11131 Using further linearizations like 
in |20j one can obtain a new set of equations that are 
equivalent to Eas. lll3l up to an error of O (n -1 / 2 ), hav- 
ing two main implementation advantages: memory re- 
quirements of just O (n) (in addition to the set of pat- 
terns which amounts to an 2 bits), and needing just O (n) 
(slow) hyperbolic function computations in addition to 
O (n 2 ) elementary (fast) floating point operations. BP 
equations can also be simplified by approximating rrik-^b 
by TOfe in Eas. 11131 (without correction terms), giving a 
simple closed expression in the quantities {m*}. The re- 
sulting equation is not asymptotically equivalent to BP 
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FIG. 3: Learning of an pseudo-random patterns curves for 
the binary perceptron for different values of 70 (n = 10 4 + 1, 
20 samples). The running time scales with 70 roughly as 
l/(l — 7o). Inset: evolution of Q'and E 1 vs. time t for various 
kinds of two-layer network topologies, i.e. n = 3 7 , a — 0.5 and 
K £ {3°, 3\ . . . , 3 6 } . Note that the number of errors E goes 
to in all cases. 

anymore (although the approximation itself has an er- 
ror of O (n -1 / 2 ) it participates in a sum of n terms), 
but nonetheless gives comparable (just slightly worse) 
algorithmic performances. Of particular interest are the 
corresponding equations for 70 — (full reinforcement) 
which take a simple additive form if written in terms of 
the local fields h\: 

= E E 4*4 ~ K +1 = K + 4< ( 6 ) 

where ug = / (j^k^ 7^ tanh/i|, ± £ fc _^ tanh2 h i) and 
t scales as anr. By choosing at time r one pattern £b x 
from the set S, Eq. implements a sequential learning 
protocol, still leading to an extensive memory capacity 
(around a max ~ .5 for the binary perceptron). The 
simplicity of Eq. represents a proof-of-concept of how 
highly non-trivial learning can take place by message- 
passing between simple devices disposed over the network 
itself. This fact could shed some light on the biological 
treatment of information in neural systems |24j . 
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